Web Information Extraction by Semantic Tagging

نویسندگان

Mirel Cosulschi

Roberto De Virgilio

Tommaso Di Noia

Roberto Mirizzi

چکیده

An important aspect of research for Web information extraction relates to the inference of complex reasoning and correlation based on distributed information available in many different Web data sources. By defining the semantics of information and services available on the Web, the World Wide Web becomes a vast store of information that can be easily processed by computer applications. Semantic Web aims at creating a universal medium where data and knowledge can be exchanged between applications. In this framework, we extend a structure discovery technique [1] that: (i) identifies blocks grouping semantically related objects occurring in Web pages, and (ii) generates a logical schema of a Web site by semantic tagging support. Very often a Web page does not relate to a single semantic topic. The decomposition of a Web page into smaller semantic annotated fragments would surely help in supporting more accurate results for semantic Web searches, richer data integration and better navigation experience. On the one hand, in case the original Web page is already annotated, the annotation can be used by the Web page segmentation process together with the visual and structural information. On the other hand, when no annotation is available (this is the most frequent case in the current Web), the page has to be decomposed via a segmentation process and then each extracted page-block has to be annotated. In case of automatic annotation, the latter approach could facilitate this process due to the reduction in the size of input data. Using a Data Reverse Engineering process [1], the logical schema of the Web site can be obtained from the conceptual representation and then one can label the extracted HTML blocks using RDFa (e.g. an HTML block with paper details as title and author). Figure 1 sketches the process. The semi-automated annotation of a page fragment might exploit both techniques for named entities extraction and the availability of a huge base of shared and inter-linked data, the so called Web of Data. The first step towards the interpretation of the information contained within an extracted Web block passes through the identification of named entities [3] in the block-body. These entities may be automatically classified with respect to a shared generic ontological schema (e.g. DBpedia.org, OpenCyc.org or FreeBase.com) or highly specialized and contextualized ones (see MusicBrainz.org as an example). Actually, information within a block might not refer only to named entities but also to generic concepts. In these cases, the use of semantic-enhanced vocabulary such as WordNe might help in the identification of the main concepts

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

SEIMCHA: a new semantic image CAPTCHA using geometric transformations

As protection of web applications are getting more and more important every day, CAPTCHAs are facing booming attention both by users and designers. Nowadays, it is well accepted that using visual concepts enhance security and usability of CAPTCHAs. There exist few major different ideas for designing image CAPTCHAs. Some methods apply a set of modifications such as rotations to the original imag...

متن کامل

Linguistic Annotation for the Semantic Web

Establishing the semantic web on a large scale implies the widespread annotation of web documents with ontology-based knowledge markup. For this purpose, tools have been developed that allow for semi-automatic annotation of web documents with ontology-based metadata. However, given that a large number of web documents consist either fully or at least partially of free text, language technology ...

متن کامل

Ontea: Platform for Pattern Based Automated Semantic Annotation

Automated annotation of web documents is a key challenge of the Semantic Web effort. Semantic metadata can be created manually or using automated annotation or tagging tools. Automated semantic annotation tools with best results are built on various machine learning algorithms which require training sets. Other approach is to use pattern based semantic annotation solutions built on natural lang...

متن کامل

Towards Large Scale Semantic Annotation Built on MapReduce Architecture

Automated annotation of the web documents is a key challenge of the Semantic Web effort. Web documents are structured but their structure is understandable only for a human that is the major problem of the Semantic Web. Semantic Web can be exploited only if metadata understood by a computer reach critical mass. Semantic metadata can be created manually, using automated annotation or tagging too...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

RITA

دوره 16 شماره

صفحات -

تاریخ انتشار 2009

Web Information Extraction by Semantic Tagging

نویسندگان

چکیده

منابع مشابه

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

SEIMCHA: a new semantic image CAPTCHA using geometric transformations

Linguistic Annotation for the Semantic Web

Ontea: Platform for Pattern Based Automated Semantic Annotation

Towards Large Scale Semantic Annotation Built on MapReduce Architecture

عنوان ژورنال:

اشتراک گذاری